ABanditNAS: Anti-Bandit for Neural Architecture Search

95

According to Eq. 4.6, the variance in the anti-bandit algorithm is bounded, and the

lower/upper confidence bounds can be estimated as

˜rk

2 ln N

n

rk˜rk +

2 ln N

n

.

(4.8)

4.2.2

Search Space

Following [307, 151, 291], we search for computation cells as the building blocks of the

final architecture. A cell is a fully connected directed acyclic graph (DAG) of M nodes, i.e.,

{B1, B2, ..., BM} as shown in Fig. 4.13. Here, each node is a specific tensor (e.g., a feature

map in convolutional neural networks), and each directed edge (i, j) between Bi and Bj

denotes an operation o(i,j)(.), which is sampled from Ω(i,j) = {o(i,j)

1

, ..., o(i,j)

K

}. {Ω(i,j)} is the

search space of a cell. Each node Bj takes its dependent nodes as input and can be obtained

by Bj = Σi<jo(i,j)(Bi). The constraint i < j here is to avoid cycles in a cell. Each cell takes

the output of the last cell as input. For brevity, we denote by B0 the last node of the

previous cell and the first node of the current cell. Unlike existing approaches that use only

normal and reduction cells, we search for v (v > 2) cells instead. For general NAS search, we

follow [151] and take seven normal operations, i.e., 3×3 max pooling, 3×3 average pooling,

skip connection (identity), 3×3 convolution with rate 2, 5×5 convolution with rate 2, 3×3

depth-wise separable convolution, and 5 × 5 depth-wise separable convolution. Considering

adversarially robust optimization for NAS, we introduce two additional operations, the 3×3

Gabor filter and denoising block, for model defense. Therefore, the size of the entire search

space is K|EMv, where EM is the set of possible edges with M intermediate nodes in the

fully connected DAG. In the case with M = 4 and v = 6, together with the input node, the

total number of cell structures in the search space is 9(1+2+3+4)×6 = 910×6. Here, we briefly

introduce the two additional operations.

Gabor filter. Gabor filters [69, 68] containing frequency and orientation representations

can characterize the spatial frequency structure in images while preserving spatial relation-

ships. This operation provides superb robustness for the network [191]. Gabor filters are de-

fined as: exp(x2+γ2y2

2σ2

) cos(2π x

λ +ψ). Here, x = x cos θ+y sin θ and y =x sin θ+y cos θ.

σ, γ, λ, ψ, and θ are learnable parameters. Note that the symbols used here apply only

to the Gabor filter and are different from the symbols used in the rest of this chapter.

Figure 4.2(b) shows an example of Gabor filters.

Denoising block. As described in [253], adversarial perturbations on images will intro-

duce noise in the features. Therefore, denoising blocks can improve adversarial robustness by

denoising features. Following this, we add the nonlocal mean denoising block [22] as shown

in Fig. 4.2(c) to the search space to denoise the features. Calculate a denoised feature map

z of an input feature map x by taking a weighted mean of the spatial locations of the fea-

tures in general L as zp =

1

C(x)



q∈L f(xp, xq) · xq, where f(xp, xq) is a feature-dependent

weighting function and C(x) is a normalization function.

4.2.3

Anti-Bandit Strategy for NAS

As described in [274, 291], the validation accuracy ranking of different network architectures

is not a reliable indicator of the final quality of the architecture. However, the experimental

results suggest that if an architecture performs poorly at the beginning of training, there

is little hope that it can be part of the final optimal model [291]. As training progresses,

this observation becomes more and more specific. On the basis of this observation, we

derive a simple but effective training strategy. During training and the increasing epochs,